Synthesis and Optimization of Reversible Circuits - A Survey 

MEHDI SAEEDI, Amirkabir University of Technology 
IGOR L. MARKOV, University of Michigan 



Reversible logic circuits have been historically motivated by theoretical research in low-power electronics 
as well as practical improvement of bit-manipulation transforms in cryptography and computer graphics. 
Recently, reversible circuits have attracted interest as components of quantum algorithms, as well as in pho- 
tonic and nano-computing technologies where some switching devices offer no signal gain. Research in gen- 
erating reversible logic distinguishes between circuit synthesis, post-synthesis optimization, and technology 
mapping. In this survey, we review algorithmic paradigms — search-based, cycle-based, transformation- 
based, and BDD-based — as well as specific algorithms for reversible synthesis, both exact and heuristic. 
We conclude the survey by outlining key open challenges in synthesis of reversible and quantum logic, as 
well as most common misconceptions. 

1. INTRODUCTION 

A computation is reversible if it can be 'undone' in the sense that the output contains 
sufficient information to reconstruct the input, i.e., no input information is erased [Tof- 
foh 1980]. It is also common to require that no information is duphcated. In Computer 
Science, reversible transformations have been popularized by the Rubik's cube and 
sliding-tile puzzles, which fueled the development of new algorithms, such as iterative- 
deepening A*-search [Korf 1999]. Prior to that, reversible computing was proposed to 
minimize energy loss due to the erasure and duplication of information. Today, re- 
versible information processing draws motivation from several sources. 

• Considerations of power consumption prompted research on reversible compu- 
tation, historically. In 1949, John Von Neumann estimated the minimum possible 
energy dissipation per bit as kBTln2 where ks ~ 1.38065 x lO^^'^J/K is the Boltz- 
mann constant and T is the temperature of environment [Von Neumann 1966]. Sub- 
sequently, Landauer [1961] pointed out that the irreversible erasure of a bit of in- 
formation consumes power and dissipates heat. While reversible designs avoid this 
aspect of power dissipation, most power consumed by modern circuits is unrelated 
to computation but is due to clock networks, power and ground networks, wires, 
repeaters, and memory. A recent trend in low-power electronics is to replace logic 
reversibility by charge recovery, e.g., through dual-rail encoding where the 01 combi- 
nation represents a logical and 10 represents a logical 1 [Kim et al. 2005].^ 

• Signal processing, cryptography, and computer graphics often require re- 
versible transforms, where all of the information encoded in the input must be pre- 
served in the output. A common example is swapping two values a and b without in- 
termediate storage by using bitwise XOR operations a ~ a®b, b = a(Bb, a = a®b. Given 
that reversible transformations appear in bottlenecks of commonly-used algorithms, 
new instructions have been added to the instruction sets of various microprocessors 
such as vperm in PowerPC AltiVec, bshuf f le in Sun SPARC VIS, permute and mix in 
HP PA-RISC, pshuf b in Intel IA32 and mux in Intel IA64 to improve their performance 



^While charge recovery reminds conservative logic [Fredkin and Toffoli 1982], its essential property is to 
avoid dissipating electric charges by exchanging them. This property requires transistor-level support and 
is not specific to logic circuits as it also applies to clock networks. 
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[McGregor and Lee 2003]. In particular, the performance of cryptographic algorithms 
DES, Twofish and Serpent, as well as string reversals and matrix transpositions, can 
be considerably improved by the addition of bit-permutation instructions [Shi and 
Lee 2000; Hilewitz and Lee 2008]. In another example, the reversible butterfly opera- 
tion is a key element for Fast Fourier Transform (FFT) algorithms and has been used 
in application-specific Xtensa processors from Tensilica. Reversible computations in 
these applications are usually short and hand-optimized. 

• Program inversion and reversible debugging generalize the 'undo' feature in 
integrated debugging environments and allow reconstructing sequences of decisions 
that lead to a particular outcome. Automatic program inversion [Gliick and Kawabe 
2005] and reversible programming languages [Yokoyama et al. 2008; De Vos 2010b] 
allow reversible execution. Reversible debugging [Visan et al. 2009] supports reverse 
expression watch-pointing to provide further examination of a problematic event. 

• Networks on chip with mesh-based and hypercubic topologies [Dally and Towles 
2003] perform permutation routing among nodes when each node can both send 
and receive messages. To route a message, regular permutation patterns such as 
bit-reversal, complement and transpose are applied to minimize the number of com- 
munication steps. 

• Nano- and photonic circuits [Politi et al. 2009; Gao et al. 2010] are made up of 
devices without gain, and they cannot freely duplicate bits because that requires 
energy. They also tend to recycle available bits to conserve energy. Generally, building 
nano-size switching devices with gain is difficult because this requires an energy 
distribution network. Therefore, reversibility is fundamentally important to nano- 
scale computing, although specific constraints may vary for different technologies. 

• Quantum computation [Nielsen and Chuang 2000] is another motivation to study 
reversible computation because unitary transformations in quantum mechanics are 
reversible. Quantum algorithms have been designed to solve several problems in 
polynomial time [Bacon and van Dam 2010; Childs and van Dam 2010], where best- 
known conventional algorithms take more than polynomial time.^ A key example is 
number-factoring, which is relevant to cryptography. While unitary transformations 
can be difficult to work with in general, many prominent quantum algorithms con- 
tain large blocks with reversible circuits that do not invoke the full power of quan- 
tum computation, e.g., for arithmetic operations [Beckman et al. 1996; Van Meter 
and Itoh 2005; Takahashi and Kunihiro 2008]. Circuits for quantum error-correction 
contain large sections of reversible circuits that implement GF(2)-linear transforma- 
tions [Aaronson and Gottesman 2004]. 

In software and hardware applications of reversible information processing, se- 
quences of reversible operations can be viewed as reversible circuits. For example, 
swapping two values x and y with a sequence of three XOR or CNOT gates (shown 
in Fig. la), operations x~xQ)y,y~xQ)y, and x ~ x Q) y is illustrated in Fig. lb 
by a circuit. Such circuits are particularly useful in quantum computing. Reversibil- 
ity prohibits loops and explicit fanouts in circuits,^ and each gate must have an equal 
number of inputs and outputs with unique input-to-output assignments. Such pecu- 
liar features of reversible circuits prevent the use of existing algorithms and tools for 



^BQP (Bounded-Error Quantum Polynomial-Time) is the class of problems solvable by a quantum algorithm 
in polynomial time with at most i probability of error. Pis the class of problems solvable by a deterministic 
Turing machine in polynomial time. Quantum computers have attracted attention as several BQP problems 
of practical interest are expected to be outside P. 

■'Read-only fanouts do not conflict with this requirement as illustrated by line x in Fig. Ic, and arbitrary 
fanouts can be simulated using ancilla lines, as we show in Fig. 3b. 
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Fig. 1. Expressing CNOT and Toffoli gates using AND and XOR gates (a), swapping two values x and y 
by three XOR operations (CNOT) as a reversible circuit (b), a reversible half-adder circuit (c), a sample 
reversible function (d). 



circuit synthesis and optimization. Reversible logic synthesis is the process of generat- 
ing a compact reversible circuit from a given specification. Research on reversible logic 
synthesis has attracted much attention after the discovery of powerful quantum algo- 
rithms in the mid 1990s [Nielsen and Chuang 2000]. Closely related techniques have 
also been motivated by other applications, e.g., the decomposition of permutations into 
tensor products is an important step in deriving fast algorithms and circuits for digital 
signal processing (Fourier and cosine transforms, etc.) [Egner et al. 1997]. 

This survey discusses methodologies, algorithms, benchmarks, tools, open problems 
and future trends related to the synthesis of combinational reversible circuits. The re- 
maining part is organized as follows. In Section 2 basic concepts are introduced. We 
outline the process of reversible synthesis in Section 3, including optimization and 
technology mapping. Algorithmic details are examined in Sections 4 and 5. Available 
benchmarks and tools for reversible logic are introduced in Section 6. Finally, we dis- 
cuss open challenges in reversible circuit synthesis in Section 7. 

2. BASIC CONCEPTS 

In this section, we introduce reversible logic gates and quantum gates, as well as re- 
versible and quantum circuits. Representations of reversible functions and cost models 
for reversible gates are also discussed. 

2.1. Reversible Gates and Circuits 

Let ^ be a finite set and / : A — > A a one-to-one and onto (bijective) function, i.e., a 
permutation. For instance, the function g ~ (1,5,3,2,0,4,6,7) is a permutation over 
{0, 1, • • • , 7} where g{0) = 1, g{l) ~ 5, g{2) = 3, etc. The set of all permutations on 
A = {0, 1, • • • , 2" — 1} forms the symmetric group^ Sn on A. A reversible Boolean func- 
tion is a multi-output Boolean function with as many outputs as inputs, that is re- 
versible. Fig. Id illustrates a reversible function on three variables that implements 
the permutation F = {0, 1, 2, 7, 4, 5, 6, 3}. 

Cycles. A cycle (ai, 02, • • • , ak) is a permutation such that /(ai) = 02, ./(a2) = 03, 
and /(flfc) = ai. For example, g can be written as (0, 1, 5, 4)(2, 3)(6)(7). The length of a 
cycle is the number of elements it contains. A cycle of length two is called a transpo- 
sition. A cycle of length k is called a k-cycle. 1-cycles, e.g., (6) and (7) in g, are usually 
omitted. Cycles ci and C2 are disjoint if they have no common members. Any permuta- 



*In abstract algebra, a group is a set with a binary operation on it, viewed as multiplication, which is 
associative {a ■ b) ■ c = a ■ {b ■ c), has a neutral element e such that a ■ e = e ■ a = a, and admits an inverse 
for every element a ■ = e. 
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Fig. 2. Basic reversible gates. The Peres gate (reversible half-adder) is defined in Fig. Ic. The MAJ and UMA 
gates together form a full-adder gate, used in [Cuccaro et al. 2005] to build reversible multi-bit adders. 



tion can be written as a product of disjoint cycles. This decomposition is unique up to 
the order of cycles. The composition of two disjoint cycles does not depend on the order 
in which the cycles are applied — disjoint cycles commute. In addition, a cycle may be 
written in different ways as a product of transpositions, e.g., g = (0, 1)(0, 5)(0, 4)(2, 3) 
and g ~ (4, 5)(0, 1)(1, 5)(4, 5)(0, 4)(2, 3). A cycle is even (odd) if it can be written as an 
even (odd) number of transpositions, i.e., a fc-cycle is odd (even) if k is even (odd). The 
same definition applies to even and odd permutations in general. 

Reversible gates. A reversible gate realizes a reversible function. For a gate g, the 
gate g^^ implements the inverse transformation. Common reversible gates are illus- 
trated in Fig. 2. 

• A multiple-control Toffoli gate [Toffoli 1980] C'"N0T(2;i, a;2, • • • , x^+i) passes the first 
TO lines, control lines, unchanged. This gate flips the (?t! + l)-th line, target line, if and 
only if each positive (negative) control line carries the 1 (0) value. For to = 0, 1, 2 the 
gates are named NOT (N), CNOT (C), and Toffoh (T), respectively These three gates 
compose the universal NCT library. 

• A multiple-control Fredkin gate [Fredkin and Toffoli 1982] Fred(a;i, 2:2, • • • , Xm+2) has 
two target lines Xm+i,Xjn+2 and m control lines xi, X2, • • • ,Xm- The gate interchanges 
the values of the targets if the conjunction of all to positive (negative) controls evalu- 
ates to 1 (0). For m = 0, 1 the gates are called SWAP (S) and Fredkin (F), respectively. 

• A Peres gate [Peres 1985] a;2, xa) has one control line xi and two target lines X2 
and X3. It represents a C^NC)T(a;i, 2:2, .T3) and a CNOT(a;i, 2:2) in a cascade. 

• An in-place majority (MAJ) gate computes the majority of three bits in place [Cuccaro 
et al. 2005], and provides the carry bit for addition. Cascading it with an Un-majority 
and Add (UMA) gate [Cuccaro et al. 2005] forms a full adder. 

In multiple-control Toffoli and Fredkin gates, each line is either control or target. 
The order of controls is immaterial and so is the order of targets, but interchanging 
controls with targets will create a different gate. A multiple-control Toffoli (or Fredkin) 
gate implements a single transposition if only incident bit-lines are considered. The 
transposition is determined by the controls of the gate. If an extended set of bit-lines 
is considered, these gates will implement sets of disjoint transpositions.^ Multiple- 
control Toffoli and Fredkin gates are self-inverse. For a Peres gate P{xi,X2,X3), the 
inverse Peres is the CN0T(2;i, 2:2) C^N0T(2;i, 2:2, 2:3) pair. For MAJ and UMA gates, the 
inverse gates can be constructed by reordering the CNOT and Toffoli gates. 

Reversible circuits. A combinational reversible circuit is an acyclic combinational 
logic circuit in which all gates are reversible, and are interconnected without explicit 
fanouts and loops. In this survey, gates in a circuit diagram are processed from left 
to right. A reversible half-adder circuit in Fig. Ic implements the conventional half- 



^If logic and 1 are encoded as 01 and 10, respectively (dual-rail), SWAP performs inversion and the Fredkin 
gate models the function of the CNOT gate. 
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Fig. 3. The structure of inputs and outputs in a reversible circuit (a), explicit fanout in a reversible circuit 
with a CNOT gate and one ancilla (b), a reversible circuit for computing y = f{xi, X2, x^) described by Cf 
and garbage line g (c), a 2-bit ripple-carry adder [Cuccaro et al. 2005] (d). 



adder when z = 0.^ For a set of gates gi, 52, gk cascaded in a circuit C in sequence, 
the circuit C^^ ~ 9k^9k-i ' ' ' (where g^^ is the inverse of g^) implements the in- 
verse transformation with respect to C. Different circuits computing the same function 
are considered equivalent. For example, circuits Ci=SWAP(a;, y) and C2 =CNOT(x, y) 
CNOTCy, x) CNOT(a;, y) (Fig. lb) are equivalent. For a library C, an C-circuit is com- 
posed only of gates from C. A permutation is C-constructible if it is computable by 
an ^-circuit. When the library consists of a single gate (type), we use the gate name 
instead of C. We call permutations implementable with only NOT, CNOT, or Toffoli 
gates N-constructible, C-constructible, or T-constructible, respectively. S'„ has 2" N- 
constructible, nr=o^(2" - 2') C-constructible, and (l/2)(2" - n - 1)! T-constructible per- 
mutations [Shende et al. 2003]. Every even permutation is NCT-constructible [De Vos 
et al. 2002; Shende et al. 2003]. When dealing with n bits, reversible logic synthesis 
searches for solutions in a space of 0(n2") elements [Saeedi et al. 2010a]. A function 
/ is affine-linear, or linear in short, \i f{xi® X2) = f{xi) ® f{x2) where © is a multi- 
bit XOR operation. NC-constructible permutations are linear functions and vice versa 
[Patel et al. 2008]. NCTSFP is the library consisting of NCT gates with SWAP, Fredkin 
and Peres gates added. 

Ancilla lines. There are 2"! distinct reversible functions on n variables which 
are permutations for 2" elements. However, X^"^^ (2*) ~ 2"^" irreversible multiple- 
output (from 1 to n) functions exist. To make the specification reversible, input/output 
should be added. The added lines are called ancillae and typically start out with the 
or 1 constant. An ancilla line whose value is not reset to a constant at the end of 
the computation is called a garbage line. Unconstrained outputs of ancillae lines in 
the truth table are called don't cares (DC). For an irreversible specification where 
each output combination can be repeated up to M times, g ~ [logj M ] ancillae are 
required to build a reversible circuit [Maslov and Dueck 2004]. For example, at least 
two garbage lines (/2 and /a) and one constant line (x) are required to make the AND 
gate reversible as shown in Fig. Id. Every odd permutation can be implemented with 
an NCT-circuit using one ancilla bit [Shende et al. 2003]. The Toffoli gate can be used 
with one constant line to compute the NAND function, i.e., C^NOT(a, b, 1), making Tof- 
foli a universal gate in the Boolean domain. In general, the number of constant lines 
plus primary inputs is equal to the number of garbage lines plus primary outputs. See 
Fig. 3a for an illustration. A reversible copying gate or explicit fanout can be simulated 
by a CNOT and one ancilla, which leaves no garbage bit at the output as illustrated in 
Fig. 3b. 

Reversible implementations. Toffoli [1980] proposed a generic NCT-circuit con- 
struction for an arbitrary reversible or irreversible function. For an implementation of 



''Similar to the conventional arithmetic circuits that are typically designed in terms of half- and full-adders, 
identifying useful blocks such as half-adders is also common in reversible logic [Beckman et al. 1996; Van 
Meter and Itoh 2005; Cuccaro et al. 2005]. 
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any irreversible function f{x), its reversible implementation can be described in the 
form (x, y) i-)- {x,y® /(x)). This specification is reversible since composing it with itself 
produces {x,y® f{x) ® fix)) = (a;, y). Given a conventional circuit for /, a reversible cir- 
cuit can be constructed by making each gate reversible using a set of temporary lines 
if necessary. To reuse these temporary lines again, their values should be restored to 
their initial values. To restore the values, first copy function outputs to a set of an- 
cillae with initial values and then run the obtained reversible circuit in reverse to 
recover the starting values [Bennett 1973] as illustrated in Fig. 3c. Fig. 3d shows a 
2-bit ripple-carry adder with one ancilla [Cuccaro et al. 2005] where values of ao and 
fli are recovered after computation. Note that if the values of temporary lines in a cir- 
cuit are not restored, this circuit cannot be inverted and is not convenient as building 
blocks for larger circuits. 

Representation models. Reversible functions can be described in several ways, as 
illustrated in Fig. 4. 

• Truth tables. The simplest method to describe a reversible function of size n is a truth 
table with n columns and 2" rows. 

• Matrix representations. A Boolean reversible function (permutation) / can be de- 
scribed by a 0-1 matrix with a single 1 in each column and in each row (a permutation 
matrix), where the non-zero element in row i appears in column f{i). A different ma- 
trix representation for linear functions [Patel et al. 2008] is described in Section 4.2. 

• Reed-Muller expansion. To denote a specification with algebraic formula. Positive po- 
larity Reed-Muller (PPRM) expansion can be applied. PPRM expansion uses only un- 
complemented variables and can be derived from the EXOR-Sum-of-Products (ESOP) 
description by replacing a' with a 1 for a complemented variable a. The PPRM ex- 
pansion of a function is canonical^ and is defined as follows. 

f{xi,X2, ...jXn) = ao® aixi © • • • ® a„a;„ © a\2X\X2 © ■ • • 
©a„,„_ia;„„ia;„ © ... © ax2...n.x\X2 ■ ■ ■ x^ 

A compact way to represent PPRM expansions is the vector of coefficients ao, ai, 
ai2...n, called the RM spectrum of the function. Consider an n-variable function and 
record its values (from the truth table) in a 2"-element bitvector F. Then, the RM 
spectrum (R) of F over the two-element field® GF(2) is defined as i? = AF^F where 



^ [1], Af" 



(2) 



• Cycle expansion. Viewing a reversible function as a permutation, one can represent 
it as a product of disjoint cycles. 

• Decision Diagrams. A reversible function can be represented by a Binary Decision Di- 
agram (BDD) [Bryant 1986; Hachtel and Somenzi 2000]. A BDD is a directed acyclic 
graph where the Shannon decomposition (i.e., / = Xtfx,=o + Xif^^^i) is applied on 
each non-terminal node. Bryant [1986] proposed Reduced Ordered BDDs (ROBDDs), 
which offer canonical representations of Boolean functions. An ROBDD can be con- 
structed from a BDD by ordering variables, merging equivalent sub-graphs and re- 
moving nodes with identical children. Several more specialized BDD variants have 
emerged for reversible and quantum circuits [Viamontes et al. 2009]. In general, a 



canonical form is a way to rule out multiple representations of the same object. Given two different 
representations, they can be converted to canonical forms. The objects are equivalent if and only if the 
canonical forms match. 

**The finite field GF(2) consists of elements and 1 for which addition and multiplication are equivalent to 
logical XOR and AND operations, respectively. 
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Fig. 4. A sample quantum circuit that implements a reversible specification (a), the specification in various 
formats: irreversible truth table (b), reversible truth table (c), matrix representation (d), cycle form (e), 
PPRM (f), RM spectrum (g), ROBDD (h). 



BDD of a function may need an exponential number of nodes. However, BDD variants 
can represent many practical functions with only polynomial numbers of nodes. 

Factorizations. A given cycle of length > 2 can be factorized into smaller cycles. 
For example, the 4-cycle (0, 1, 5, 4) can be factorized into three 2-cycles (0, 1)(0, 5)(0, 4). 
A factorization is of type a = (a2, . . . , afe) if it results in exactly a2 2-cycles, 3-cycles 
and so on. Define {a) ~ Y,j>2{j ^ 1) x for an n-bit permutation where a satisfies 
(a) > n — 1. A factorization is minimal if (a) = n — 1. For instance, the factoriza- 
tion (0,1,5)(0,4)(2,3) of 5 = (0, 1,5,4)(2,3) is of type a = (2,1) and is not minimal, 
2x(2 — l) + lx(3— 1) = 4>2. Two factorizations are equivalent if one can be obtained 
from the other by repeatedly exchanging adjacent factors that are disjoint. For exam- 
ple, factorizations (0, 1, 5)(0, 4)(2, 3) and (3, 2)(5, 0, 1)(0, 4) are equivalent. Since cycles 
(0, 1, 5) and (0, 4) share a common element, they do not commute. 

2.2. Quantum Gates and Circuits 

A quantum bit, qubit, can be treated as a mathematical object that represents a quan- 
tum state with two basic states |0) and It can also carry a linear combination 
= q;|0) + of its basic states, called a superposition, where a and (3 are complex 
numbers and |Q;p+|/3p=l. Although a qubit can carry any norm-preserving linear com- 
bination of its basic states, when a qubit is measured, its state collapses into either |0) 
or |1) with probabilities |ap and respectively. A quantum register of size n is an or- 
dered collection of n qubits. Apart from the measurements that are commonly delayed 
until the end of a quantum computation, all quantum computations are reversible. 

Quantum gates. A matrix U is unitary if UU^ = I where L/^ is the conjugate trans- 
pose of U and / is the identity matrix. An n-qubit quantum gate is a device which 
performs a 2" x 2" unitary operation U on n qubits in a specific period of time. For 
a gate g with a unitary matrix Ug, its inverse gate implements the unitary ma- 
trix Ug^. Among various quantum gates with different functionalities [Nielsen and 

Chuang 2000] are Hadamard (H), phase shift (Re), controlled-V, controlled- and the 
Pauli gates, which are defined in Fig. 5. For = 7r/2 (e'^ = i) and 9 = 7r/4, the phase 
shift gate is named the Phase (P) and | (T) gates, respectively. NCV is the library of 
NOT, CNOT, controlled-V and controlled- For an arbitrary single-qubit gate U, a 
controlled-?/ gate is a 2-qubit gate with one control and one target which applies U 
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Fig. 6. A three-qubit Quantum Fourier Transform (a), decomposing the Toffoli gate into one-qubit and six 
CNOT gates; six CNOT gates are required [Shende and Markov 2009] (b). 



on the target qubit whenever the control condition is satisfied. Basic quantum gates 
are illustrated in Fig. 5. The set of reversible gates is a subset of all possible quan- 
tum gates, distinguished by having only Os and Is as matrix elements. It would be 
misleading to call reversible circuits quantum just because they are used in quantum 
information processing. As we show in Section 2, reversible circuits can be described 
and manipulated without leaving the Boolean domain. The size of reversible circuits 
can sometimes be reduced by introducing non-Boolean gates (Section 5). 

Quantum circuits. A quantum circuit consists of quantum gates, interconnected 
by qubit carriers (i.e., wires) without feedback and explicit fanouts. Fig. 6a illustrates 
a 3-qubit quantum circuit for the Quantum Fourier Transform (QFT) which includes 
the Hadamard and phase shift gates. The inverse of a quantum circuit is constructed 
by inverting each gate and reversing their order. A set of gates is universal for quan- 
tum computation if any unitary operation can be approximated with arbitrary accu- 
racy by a quantum circuit which contains only those gates. The gate library consist- 
ing of CNOT and single-qubit gates is universal for quantum computation [Nielsen 
and Chuang 2000]. Fig. 6b shows a decomposition of the Toffoli gate into H, T, T^, and 
CNOT gates; six CNOTs are required for Toffoh [Shende and Markov 2009]. The search 
space for quantum-logic synthesis is not finite, and circuits implementing generic uni- 
tary matrices require 17(4") gates [Shende et al. 2004]. 

Stabilizer circuits. The gates Hadamard, Phase, and CNOT are called stabilizer 
gates. A stabilizer circuit is a quantum circuit consisting of stabilizer gates and mea- 
surement operations. Stabilizer circuits have applications in quantum error correc- 
tion, quantum dense coding, and quantum teleportation [Nielsen and Chuang 2000]. 
According to the Gottesman-Knill theorem [Nielsen and Chuang 2000], quantum cir- 
cuits exclusively consisting of the following components can be efficiently simulated on 
a classical computer in polynomial time: 

• A state preparation N-circuit with initial value |000...0) — qubit preparation in the 
computational basis, 

• Quantum gates from the Clifford group (Hadamard, Phase, CNOT, and Pauli gates), 

• Measurements in the computational basis 
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Evaluation and simulation of quantum circuits. For matrices A^xn and Bpxq, 

the tensor (Kronecker) product yl (g) i? is a matrix of size mp x nq in wliich eacli element 
of A is replaced by B. The unitary matrix effected by several gates acting on disjoint 
qubits (in parallel) can be calculated as the tensor (Kronecker) product of gate matri- 
ces. For a set otk gates gi, 52, gt with matrices Ui, U2, Uk cascaded in a quantum 
circuit C (sequentially), the matrix of C can be calculated as UkUk~i---Ui. Straight- 
forward simulation of quantum circuits by matrix multiplication requires 17(2") time 
and space [Viamontes et al. 2009]. To improve runtime and memory usage, algorithmic 
techniques have been developed for high-performance simulation of quantum circuits 
[Shi et al. 2006; Viamontes et al. 2009].^ 

Quantum circuit technologies. To physically implement qubits, differ- 
ent quantum-mechanical systems have been proposed, each with particular 
strengths and weaknesses, as discussed in the Quantum Computation Roadmap 
(http://qist.lanl.gov/qcomp_map.shtml). Leading candidate technologies represent the 
state of a qubit using 

• A two-level motion mode of a trapped ion or atom, 

• Nuclear spin polarizations in nuclear magnetic resonance (NMR), 

• Single electrons contained in Gallium arsenide (GaAs) quantum dots, 

• The excitation states of Josephson junctions in superconducting circuits, 

• The horizontal and vertical polarization states of single photons. 

Quantum gates are effected by shining laser pulses on neighboring ions or atoms, ap- 
plying electromagnetic pulses to spins in a strong magnetic field, changing voltages 
and/or current in a superconducting circuit, or passing photons through optical media. 
These and other technologies are discussed in textbooks [Nielsen and Chuang 2000] 
and research publications [Politi et al. 2009; Gao et al. 2010]. 

Interpreting quantum circuit diagrams. Representing quantum circuits with 
circuit diagrams invites analogies with conventional CMOS circuits, but there are sev- 
eral fundamental differences. 

(1) In a quantum circuit, qubits typically exist as fixed physical entities (e.g., electrons, 
photons or nuclei), and quantum gates operate on a qubit register (some gates can 
be invoked in parallel). This is in contrast to conventional semiconductor circuits 
where signals travel through gates, often fanning out and reconverging. 

(2) Wirelines in a quantum circuit are used to trace the different states of a qubit 
during computation. Unlike in conventional circuits, wirelines in quantum and re- 
versible circuits have sequential semantics. This can be illustrated by considering 
constant-propagation, i.e., simplifying a circuit when some of the inputs are given 
known values. Even when the input values are or 1, wirelines in circuits like the 
reversible adder from [Cuccaro et al. 2005] (Fig. 3d) cannot be removed because 
they are also used to store intermediate values of computation. 

(3) In many implementations, quantum gates are invoked by electromagnetic pulses, 
in which case the different gates of a combinational circuit appear for short pe- 
riods of time and then disappear. This is in contrast to more familiar circuits in 
semiconductor chips, where independently existing gates are connected by metal- 
lic interconnect. Photonic quantum circuits use explicit interconnect in the form of 
photonic waveguides. 



^PP (Probabilistic Polynomial-Time) is the class of decision problems solvable by an NP (Nondeterministic 
Polynomial-Time) machine which gives the correct answer (i.e., Tfes' or 'No') with probability > ^. P'''' (P 
with PP oracle) includes decision problems solvable in polynomial time with the help of an oracle for solving 
problems from PP. Quantum circuit simulation belongs to the complexity class P'^''. 
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(4) Conventional circuits are typically synchronized through sequential elements 
(latches and flip-flops) because the timing of individual gates cannot be controlled 
accurately. In quantum circuits where each gate can be invoked at a precisely spec- 
ified moment in time, there is no need for synchronization using sequential gates, 
and the entire computation can be scheduled by timing each combinational gate. 

(5) In conventional circuits, each wire is assumed to carry a or 1 signal, and each out- 
put of a combinational circuit is deterministically observable at the end of a clock 
cycle. However, these assumptions break down in a quantum circuit that generates 
non-Boolean values [Nielsen and Chuang 2000] because (i) multiple qubits can be 
entangled, (m) to directly observe a qubit, it must be measured, which generates a 
nondeterministic outcome and affects other entangled qubits. 

These differences between quantum and conventional circuits are sometimes misun- 
derstood in the literature, as we point out in Section 7. 

2.3. Circuit Cost lUlodels 

Current quantum technologies suffer from intrinsic limitations which prohibit some 
circuits and favor others, prime examples are the small number of available qubits and 
the requirement that gates act only on geometrically adjacent qubits (in a particular 
layout). To be relevant in practice, circuit synthesis algorithms must be able to sat- 
isfy technology-specific constraints and improve technology-specific cost metrics. For 
example, currently popular trapped-ions [Haffner et al. 2005] and liquid-state NMR 
[Negrevergne et al. 2006] technologies allow computation on sets of 8-12 qubits in a 
linear nearest neighbor (LNN) architecture where only adjacent qubits can interact. 
Furthermore, a physical qubit can hold its state only for a limited time, called coher- 
ence time, which varies among different technologies from a few nanoseconds to several 
seconds [Van Meter and Oskin 2006]. Because of decoherence, qubits are fragile and 
may spontaneously change their joint states. 

Just as in conventional circuits, the trivial gate count metric does not adequately 
reflect the resources required by different gates. Similar to transistor counts, used 
to compare logic gates implemented in CMOS chips, one can define the technology- 
specific cost of quantum gates by decomposing them into elementary blocks supported 
by a particular technology. A physical implementation of an elementary operation de- 
pends on the Hamiltonian^" of a given quantum system [Zhang et al. 2003]. For exam- 
ple in a one-dimensional exchange, i.e., Ising Hamiltonian characterized by interaction 
in the z direction only, the 2-qubit SWAP gate requires three qubit interactions. In a 
two-dimensional exchange with the XY Hamiltonian, it can be implemented by a sin- 
gle two-qubit interaction. In an ion trap system, "elementary gates" are implemented 
with carefully tuned RF pulse sequences. Gate costs can be affected not only by di- 
rect resource requirements (size, runtime, available frequency channels) but also by 
considerations of circuit reliability in the context of frequent transient errors (e.g., 
decoherence of quantum bits). Some gates may be more amenable to error-correction 
than others, e.g., the CNOT gate and other linear transformations allow for convenient 
fault-tolerant extensions. In order to abstract away specific technology details, several 
abstract cost functions have been proposed in the literature. However, their relevance 
strongly depends on future developments in quantum-circuit technologies. 

• Speed was defined in [Beckman et al. 1996] to approximate the runtime of a quan- 
tum computation on an ion trap-based quantum technology, assuming all laser 
pulses take equal amounts of time. They observed that the (I!''NOT gate (A: = 1, 2, 



^"A Hamiltonian describes time-dependent behavior of a quantum system and can be compared to a set of 
forces acting on a non-quantum system. 
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etc.) can be implemented by 2k + 3 laser pulses. The authors assumed that only one 
gate can be applied at a time. 

• Number of one-qubit gates and CNOT (or any other two-qubit gate) is a complexity 
metric for quantum synthesis algorithms. Since CNOT is a linear gate, the number 
of one-qubit gates (excluding inverters) needed to express a computation is defined 
as a measure of non-linearity for a given computation [Shende and Markov 2009]. 

• Quantum cost (QC) is defined as the number of NOT, CNOT, controUed-V and 
controlled-V^ gates required for implementing a given reversible function. These 
gates can be efficiently implemented in an NMR-based quantum technology by a se- 
quence of electromagnetic pulses [Lee et al. 2006]. Under any other quantum tech- 
nology, primitive gates can be adapted similarly. For example, while Toffoli needs 
five gates from the NCV library (two CNOT, two controlled-V, and one controlled-V^ 
gates) it needs exactly six CNOTs and several one-qubit gates under the univer- 
sal set of one-qubit and CNOT gates (Fig. 6b) [Shende and Markov 2009]. In an- 
other example, the Fredkin gate is easier to implement than the Toffoli gate under 
some quantum technologies [Fei et al. 2002].^^ A single-number cost model, based 
on the number of two-qubit operations required to simulate a given gate, was used 
in [Maslov and Saeedi 2011] where costs of both n-qubit Toffoli and n-qubit Fred- 
kin gates (and n > 3) are estimated as lOn — 25. QC of a circuit is calculated by a 
summation over the QCs of its gates. 

• Interaction cost is the distance between gate qubits for any 2-qubit gate. Quantum 
circuit technologies with ID, 2D and 3D interactions exist [Cheung et al. 2007]. 
Interaction cost for a circuit is calculated by a summation over the interaction costs 
of its gates. 

• Number of ancillae and garbage bits (ancillae not reset to 0) reflects the limited 
number of qubits available in contemporary quantum computers. 

• Depth (or the number of levels) is defined as the largest number of elementary gates 
on any path from inputs to outputs in a circuit. When any subset of gates can be 
invoked simultaneously, decreasing circuit depth reduces circuit latency. This as- 
sumption is trivial for conventional semiconductor circuits because the gates are 
manufactured individually and exist at the same time. However, when quantum 
gates are invoked by electromagnetic pulses, their parallel invocation must clear 
a number of obstacles — it should be possible to select just the right set of qubits 
on which the gates are applied, which may require several laser sources and possi- 
bly several pre-determined wavelengths. When the parallel gates perform different 
functions, interference between them may limit achievable parallelism. Practical 
quantum computers can either apply the same gate to all qubits or apply different 
gates to a small number of qubits. 

As pointed out in [Beckman et al. 1996], specific quantum-circuit technologies may 
entail more involved cost functions where the delay of a gate may depend on neigh- 
boring gates. The abstract cost functions introduced above do not capture such effects. 

3. GENERATION AND OPTIMIZATION OF REVERSIBLE CIRCUITS 

In this section we outline key steps in generation and optimization of reversible 
circuits, as illustrated in Fig. 7. Algorithmic details will be given in Sections 4 and 5. 
To implement an irreversible specification using reversible gates, ancillae should be 
added to the original specification where the number of added lines, their values, and 



Fredkin can be constructed using three Toffoli gates by adding one control to each CNOT gate in Fig. lb. 
The three TofFolis can then be simplified into two CNOTs around a Toffoli. 
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Fig. 7. A general flow used in recent reversible logic synthesis methods. 



the ordering of output lines affect the cost of synthesized circuits. This process can be 
either performed prior to synthesis or in a unified approach during synthesis. 

Synthesis seeks reversible circuits that satisfy a reversible specification. It can be 
performed optimally or heuristically. 

• Optimal iterated deepening A*-search (IDA*) algorithm was used in [Shende et al. 
2003] to find optimal circuits of all 3-input reversible functions. Golubitsky et al. 
[2010] observed that an optimal realization of some reversible functions can be con- 
structed from an optimal circuit of another function — no need to synthesize all 
functions independently. For example, optimal circuits for /^^ can be constructed by 
reversing optimal circuits for /. By exploiting such symmetries and using a hashing 
technique, the authors found optimal circuits for all 4-input permutations. Symbolic 
reachability analysis [Hung et al. 2006] and Boolean satisfiability (SAT) [Grosse 
et al. 2009] have been applied to find optimal realizations for reversible functions. 
These methods mainly formulate the synthesis problem as a sequence of instances 
of standard decision problems, such as Boolean satisfiability, and use third-party 
software to solve these problem instances. Only a small number of qubits and gates 
can be handled by these methods. 

• Asymptotically optimal synthesis was proposed by Patel et al. [2008] for linear re- 
versible circuits which leads to 6(n^/logn) CNOT gates in the worst case. Maslov 
[2007] addressed depth-optimal synthesis of stabilizer circuits and proposed a syn- 
thesis algorithm that constructs circuits by concatenating 90n + 0(1) stages, each 
stage containing only one type of gates (CNOTs or certain one-qubit gates). Asymp- 
totically optimal methods may not produce optimal circuits for specific inputs. 

Since most circuits of practical interest are non-linear and too large for optimal syn- 
thesis, heuristic algorithms were proposed. The choice of a representation model for re- 
versible functions plays a significant role in developing effective synthesis algorithms. 
Each model favors certain types of reversible functions by representing them concisely. 
Synthesis algorithms are developed by detecting such simple cases and decomposing 
reversible functions into sequences of simpler functions in a given model. 

• Transformation-based methods [Miller et al. 2003; Maslov et al. 2007] iteratively 
select a gate so as to make a function's truth table or RM spectrum more similar to 
the identity function. These methods are mainly efficient for permutations where 
output codewords follow a regular (repeating) pattern. 

• Search-based methods [Gupta et al. 2006; Donald and Jha 2008] traverse a search 
tree to find a reasonably good circuit. These methods mainly use the PPRM ex- 
pansion to represent a reversible function. The efficiency of these methods is highly 
dependent on the number of circuit lines and the number of gates in the final circuit. 

• Cycle-based methods [Shende et al. 2003; Saeedi et al. 2010a] decompose a given 
permutation into a set of disjoint (often small) cycles and synthesize individual cy- 
cles separately. Compared to other algorithms, these methods are mainly efficient 
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for permutations without regular patterns and reversible functions that leave many 
input combinations unchanged. 

• BDD-based methods [Wille and Drechsler 2009; Wille et al. 2010b] use binary deci- 
sion diagrams to improve sharing between controls of reversible gates. These tech- 
niques scale better than others. However, they require a large number of ancilla 
qubits — a valuable resource in fledgling quantum computers. 

Several other heuristics do not directly use the discussed representation models. 
Some reuse algorithms developed for conventional logic synthesis, e.g., the algorithm 
proposed in [Mishchenko and Perkowski 2002] uses ancillae to convert an optimized 
irreversible circuit into a reversible circuit. In [Fazel et al. 2007] a circuit is constructed 
as a cascade of ESOP gates in the presence of some ancillae. Another approach uses 
abstract group theory to synthesize reversible circuits [Storme et al. 1999; Rentergem 
et al. 2007; Yang et al. 2006]. However, as of 2011, empirical performance of reported 
implementations lags behind that of more established approaches. Heuristic synthesis 
is discussed in Section 4.3, while synthesis of optimal circuits is explored in Sections 
4.1 and 4.2. 

Post-synthesis optimization. The results obtained by heuristic synthesis methods 
are often sub-optimal. Further improvements can be achieved by local optimization. 

• Improving gate count and quantum cost. To improve the quantum cost of a circuit, 
several techniques attempt to improve individual sub-circuits one at a time. Sub- 
circuit optimization may be performed based on offline synthesis of a set of functions 
using pre-computed tables [Prasad et al. 2006; Maslov et al. 2008a], online synthesis 
of candidates [Maslov et al. 2007; Arabzadeh et al. 2010], or circuit transformations 
that involve additional ancillae [Miller et al. 2010; Maslov and Saeedi 2011]. 

• Reducing circuit depth. To realize a low-depth implementation of a given function, 
consecutive elementary gates with disjoint sets of control and target lines should be 
used to provide the possibility of parallel gate execution. Circuit depth may also be 
improved by restructuring controls and targets of different gates in a synthesized 
circuit [Maslov et al. 2008a]. 

• Improving locality. For the implementation of a given computation on a quantum 
architecture with restricted qubit interactions, one may use SWAP gates to move 
gate qubits towards each other as much as required. The interaction cost of a given 
computation can be hand-optimized for particular applications [Fowler et al. 2004a; 
Kutin et al. 2007; Takahashi et al. 2007]. A generic approach can also be used to 
either reduce the number of SWAP gates [Saeedi et al. 2011b] or find the minimal 
number of SWAP gates [Hirata et al. 2011] for a circuit. 

Incremental optimization can significantly improve synthesis results, but it cannot 
guarantee optimality. To illustrate this, consider the NCT-optimal circuit in Fig. 8a 
[Prasad et al. 2006]. Suppose the pattern is continued by adding one gate at a time un- 
til the circuit becomes suboptimal for the function it computes. In the resulting circuit, 
no suboptimal sub-circuits are formed, and hence no local-optimization method can 
find a reduction that is available. Section 5 offers additional details on post-synthesis 
optimization. 

Technology mapping. To physically implement a circuit using a given technology, 
all gates should be mapped (decomposed) into gates directly available in this technol- 
ogy. Such technology mapping can be applied either before post-synthesis optimization 
or after. Barenco et al. [1995] showed that a multiple-control Toffoli gate in a circuit on 
n qubits can be mapped into a set of Toffoli gates, with different circuit sizes, depending 
on how many ancillae are available. 
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Fig. 8. An optimal circuit used to illustrate limitations of local optimization [Prasad et al. 2006]. If the 
pattern is continued by adding one gate at a time until the circuit first becomes suboptimal, no local change 
could make the resulting circuit optimal (a), decomposition of a C"'"+^NOT gate [Asano and Ishii 2005], 
m = 2*^, fc > 1. In this figure, fc = 1 (b). 

(1) Without ancilla, n > 3: A C""iNOT gate can be simulated by 2"-^ - 1 controlled-V 
and controlled-V^ gates and 2"-^ - 2 CNOTs. 

(2) With one ancilla, n > 7: A C'^^^OT gate can be simulated by 8(n- 5) Toffoh gates. 

(3) With m - 2 ancillae, rn e {3, 4, • • • , \n/2\ }, n > 5: A C'"NOT gate can be simulated 
by 4(?7i - 2) Toffoh gates. 

Maslov and Dueck [2003] converted Toffoli gates into (inverse) Peres gates which leads 
to 32m— 96 and 16m— 32 elementary gates from the NCV library for the cases (2) and (3), 
respectively. Asano and Ishii [2005] presented a quantum circuit, illustrated in Fig. 8b, 
to simulate a C^'"+^NOT gate on Am + 1 qubits, m = 2^, k > 1, that contains 4m units 
of Toffoli gates. Each unit performs m Toffoli operations simultaneously on 3?7i qubits. 
By eliminating individual-qubit manipulation, their circuit increases parallelism in 
quantum circuits at the cost of additional gates. 

Maslov et al. [2008a] improved the result of [Maslov and Dueck 2003] by removing 
redundant controlled-V gates which leads to 12?7i — 22 and 24n — 88 gates for (2) and 
(3), correspondingly. Fig. 9 illustrates the decomposition of a C^NOT gate where b-c, 
d, and e are the results of applying the methods of [Barenco et al. 1995], [Maslov and 
Dueck 2003], and [Maslov et al. 2008a], respectively Miller and Sasanian [2010] pro- 
posed techniques to reduce the number of elementary gates for C^NOT, c g {3, 15} 
assuming {1, 2, c — 2} ancillae. Synthesis and post-synthesis optimization methods 
which consider the underlying gate libraries, e.g., to improve locality or to decrease 
circuit depth, should also benefit from an internal technology mapping. 

4. ALGORITHMS FOR REVERSIBLE CIRCUIT SYNTHESIS 

In the following subsections, we discuss exact and asymptotically optimal synthesis 
methods followed by heuristic algorithms. 

4.1. Optimal Methods 

For a reversible circuit with n lines, where its optimal realization needs h gates from 
a library C, an enumerative method may branch h ways on each £-gate. For example, 
assume that only multiple-control Toffoli gates exist in the library. For this simplified 
case, an exhaustive method examines (n x 2""^)'' gates^^ to find an optimal circuit. For 
n = 3, the worst-case circuit needs eight gates from the NCT library. Therefore, only 



^•^There are (") possible NOT gates and possible CNOT gates in which one of its two inputs can be 
the target output. Hence, the total number of 2x (j) CNOT gates can be obtained. For a (fc+l)-bit gate, 
fc g (2, 3, ■ ■ ■ , n— 1), there are {"^^) possible gates when the target can be the i-th (« G [1, n])bit. Considering 
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Fig. 9. Decomposition of a C^NOT(a,6,c,d,e,/,ft,) gate into smaller multiple-control Toffoli gates [Barenco et 
al. 1995] (b) and Toffoli gates [Barenco et al. 1995] (c), Peres gates [Maslov and Dueck 2003] (d), elementary 
gates [Maslov et al. 2008a] (e), in the presence of one garbage line. 

12* different cases should be examined. For n = 4, optimal circuits with 15 gates exist 
[Golubitsky et al. 2010], hence, 2^^ ~ 3.8 x 10^^ different cases should be analyzed by 
exhaustive search to ensure that a min-cost circuit is found. 

3- qubit circuits. Shende et al. performed gate-count optimal synthesis of 3-bit re- 
versible functions by gradually building up a library of optimal circuits for all 8! per- 
mutations, rather than by dealing with each permutation individually [Shende et al. 
2003]. Noting that every sub-circuit of an optimal circuit is also optimal, they stored 
optimal circuits with m gates and added one gate at the end of each stored circuit in 
all possible ways. Those resulting circuits that implement new functions can be added 
to the library. To lower memory usage when synthesizing a given permutation, instead 
of examining all optimal circuits with k gates in the library for increasing values of 
k, the algorithm in [Shende et al. 2003] stops at to (m < k) gates and seeks circuits 
with m + 1 gates that implement the permutation. In the absence of solutions, it seeks 
circuits with m + 2 gates and so on. 

4- qubit circuits. Optimal synthesis of 4-bit reversible functions was investigated in 
[Prasad et al. 2006], [Yang et al. 2008], and [Golubitsky et al. 2010]. Initially Prasad 
et al. [2006] introduced a data structure to represent all 40320 optimal 3-input and 
about 26,000,000 optimal 4-input reversible circuits with up to six gates from the NCT 
library. Yang et al. [2008] improved this method where the implementation of a spec- 
ification on four variables was explored in a search tree based on a bidirectional ap- 
proach [Miller et al. 2003]. Consequently, over 50% of even 4-bit reversible circuits 
(approximately one quarter of all possible ones) were optimally realized with up to 12 
NOT, CNOT and Peres gates. Golubitsky et al. [2010] offered further improvements. 
They noted that in an optimal circuit with k gates, the first \k/2'] gates and the last 
[k/2] gates must also form optimal circuits for respective functions. Hence, they first 
synthesized all half-sized optimal circuits and stored them in a hash table. The hash 
table was searched next for finding both halves of any optimal circuit with four in- 



all possible bits as the target leads to n x ("j. ^) (fe+D-bit gates. Therefore, the total number of gates is 
(i) + 2 X Q + „ X (E,e(2...„-i) ("-')) = « X 2"-^- 
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Table I. The distribution of gate counts in gate-count optimal circuits for all 
3- and 4-qubit functions witli respect to the NCT library. 



Number of 
gates 


3-bit Functions 
Lonenae et ai. zuuoj 


4-bit Functions 

LLrOiuDitSKy ana iviasiov zuiij 


15 





144 


14 





37,481,795,636 


13 





4,959,760,623,552 


12 





10,690,104,057,901 




A 
U 


A OQG /IfiO TQO QQQ 


10 





819,182,578,179 


9 





105,984,823,653 


8 


577 


10,804,681,959 


7 


10,253 


932,651,938 


6 


17,049 


70,763,560 


5 


8,921 


4,807,552 


4 


2,780 


294,507 


3 


625 


16,204 


2 


102 


784 


1 


12 


32 





1 


1 


Total 


2^1=40,320 


2"! =20,922,789,888,000 



puts. Additionally, a simultaneous input/output relabeling (reordering) was applied, 
and symmetries of reversible functions were used to further reduce the search space. 
Optimal realization for the inverse /^^ of a function / was obtained by reversing an 
optimal circuit of /. The last two techniques reduce the search space by more than a 
factor of 48 (i.e., 2 x 4!). Running for less than 3 hours on a high-performance server 
with 16 AMD 2300 MHz processors and 64 GB RAM, Golubitsky et al. [2010] found the 
distributions of gate-count optimal 4-bit circuits up to 15 gates reproduced in Table I. 

Adapting algorithms from formal verification. In order to find optimal circuits 
for reversible functions with more than four inputs, several sophisticated techniques 
draw upon algorithms and data structures from the field of formal verification [Hachtel 
and Somenzi 2000]. Two optimal synthesis approaches for generic reversible and irre- 
versible functions were developed in [Hung et al. 2006] and [Grosse et al. 2009] where 
the former uses symbolic reachability analysis^^ and the latter applies Boolean satis- 
fiability. In [Hung et al. 2006] a circuit is considered as a cascade of L stages each of 
which is a 1-qubit or 2-qubit gate from the NOV library. Stage parameters (i.e., gate 
type and gate qubits) are modeled such that outputs of i-th stage are connected to 
inputs of [i + l)-th stage. In this scenario, a minimal-length circuit is equivalent to 
the smallest L. In contrast, for a given reversible function /, the algorithm of [Grosse 
et al. 2009] seeks the availability of a circuit implementing / with a sequence of d 
multiple-control Toffoli gates. Starting with d = 1, d is incremented until a circuit is 
found. While circuits are modeled in a similar fashion, the method in [Hung et al. 2006] 
constructs an FSM (Finite State Machine) and employs a SAT solver to find a counter- 
example. To achieve this, instead of working with L cascaded stages, 2" parallel FSM 
instances are generated for truth table rows. The outputs of all 2" instances at time 
t are inputs of modules at time t + 1. Grosse et al. [2009] used Boolean satisfiability 
and several common SAT techniques as well as problem-specific information to im- 
prove runtime. Optimal circuits with respect to interaction cost can be found similarly 
[Saeedi et al. 2011b]. To improve runtime when handling large circuits, Wille et al. 



^■'Given a finite-state maciiine described by a sequential circuit and a set of states described by a prop- 
erty, the reachability problem asks if the (un)desired states can be reached from the initial state through a 
sequence of valid transitions. 
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[2008b] used a generalization of Boolean satisfiability, Quantified Boolean Formula 
(QBF) satisfiability, and BDDs (Binary Decision Diagrams). 

4.2. Asymptotically Optimal Methods 

Aaronson and Gottesman [2004] demonstrated that any stabilizer circuit can be re- 
structured into 11 stages of Hadamard (H), Phase (P) and linear reversible circuits 
(C) in the order H-C-P-C-P-C-H-P-C-P-C. They also proved that the use of Hadamard 
and Phase provides at most a polynomial-time computational advantage since stabi- 
lizer circuits can be simulated by only NOT and CNOT gates. However, even when 
Hadamard and Phase gates are used, the size of a stabilizer circuit is likely to be dom- 
inated by the size of CNOT blocks. We therefore turn our attention to asymptotically 
optimal^* synthesis of linear functions. 

When reversible functions are captured by (unitary) matrices, each row and each col- 
umn include a single '1' and 'O's elsewhere. A different model proposed in [Patel et al. 
2008] is specific to linear circuits and represents CNOT(z, j) by inserting a '1' into the 
element of the identity matrix. This model allows one to cast synthesis of linear 
circuits as the task of reducing a given matrix (of the function to be synthesized) to the 
identity matrix by elementary row operations over GF(2). Each row operation corre- 
sponds to a CNOT gate, and the sequence of row operations gives a reversible circuit. 
This task is usually solved using Gaussian elimination, which requires O(n^) row oper- 
ations and 0{n'^ ) time. To this end, the input matrix is reduced to an upper triangular 
matrix by a set of row operations, the resulting matrix is transposed, and this process 
is repeated on the transposed matrix. To reduce the number of gates, Patel et al. [2008] 
partition an n x n matrix into a set of sections each one contains m (e.g., m ~ log2 n) 
columns. To construct an upper-triangular matrix, the algorithm eliminates repeated 
rows in each section by applying carefully-planned row operations first. Then, diagonal 
entries are fixed, and Gaussian elimination is used to remove all off-diagonal entries. 
As in the standard approach, the same scenario is applied to the transposed matrix. 
This technique reduces the worst-case number of operations (equivalently the size of 
an 77,-wire CNOT circuit) to Q{n? / \ogn) which is asymptotically optimal. Its runtime is 
improved to 0{n^/ \ogn) versus 0{n^) for Gaussian elimination. Maslov [2007] studied 
the depth (instead of size) of stabilizer circuits where only adjacent qubits can inter- 
act. By presenting a constructive algorithm based on Gauss-Jordan elimination, he 
demonstrated that any stabilizer circuit can be executed in at most 30n + 0(1) stages 
composed of only generic two-qubit gates. For the library of CNOT and single-qubit 
gates, an (asymptotically-optimal) upper bound is 90n + 0(1). 

4.3. Heuristic lUlethods 

Finding an optimal circuit for a given arbitrary-size reversible specification is in- 
tractable in general, hence heuristic methods have been developed to find reasonable 
solutions in practice. In this section, we review those methods that either significantly 
improved upon prior results or introduced new insights. 

Transformation-based methods. Miller et al. [2003] proposed a synthesis method 
that compares the identity function (/) with a given permutation (F), as illustrated in 
Fig. 10a, and applies reversible gates to transform F into /. To direct the transforma- 
tion (or select a gate), the complexity metric used is the sum of Hamming distances 



^''An algorithm is asymptotically optimal if it performs at worst a constant factor worse than the best possi- 
ble algorithm. Formally, for a problem which needs n(/(n)) overhead according to a lower-bound theorem, 
an algorithm which requires 0(/(n)) overhead is asymptotically optimal. While an such a method cannot 
find the solution optimally, no algorithm can outperform it by more than a fixed constant factor. On the other 
hand, other algorithms may find smaller circuits in specific cases, run faster or use less memory. 
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Fig. 1 0. Outlines of transformation-based algorithms (a), search-based methods (b), and search-based meth- 
ods directed by a complexity metric (c). In this figure, F is the input permutation, / is the identity function, 
Qi is a possible gate, M is the maximum number of gates, and Fij is a permutation which results from 
applying gj at the i-th level. 

between binary patterns of F and / at each truth table row. The algorithm iterates 
through the rows of the truth table, looks for differences between F and /, and cor- 
rects these differences by applying multiple-control Toffoli gates with positive controls 
only. This algorithm was improved in [Maslov et al. 2007] where the authors direct 
synthesis by the complexity of the Reed-Muller spectra instead of the Hamming dis- 
tance. The algorithm proposed in [Maslov et al. 2007] produces best-known circuits for 
several families of benchmark functions with regular patterns in their permutations. 

]VIultiple-control Toffoli gates with both positive and negative controls in a column- 
wise (vs. row-wise as in [IMiller et al. 2003]) scenario were used in [Saeedi et al. 
2007b]. This algorithm results in circuits composed of complex gates with common 
targets. Gates that share targets/controls can be further optimized by post-processing 
[Arabzadeh et al. 2010; Maslov and Saeedi 2011]. 

Search-based methods. As shown in Fig. 10b the search process can be repre- 
sented by a tree. One may search for an implementation of a function by starting 
from an initial specification (root of the tree), applying individual gates (to generate 
branches), and repeating this process on the resulting functions until the identity spec- 
ification is found in a branch. Given enough memory and time, this method can find 
a minimal circuit. It is useful when gate-counts and the numbers of inputs/outputs 
are small. To make this approach practical, one can select only those gates that min- 
imize a specific metric as illustrated in Fig. 10c. For example, in [Gupta et al. 2006] 
common sub-expressions between the PPRM expansions of multiple outputs are iden- 
tified and used to simplify the outputs at each stage. Discovered factors are substituted 
into the PPRM expansions to determine their potential for leading to a solution where 
the primary objective is gate count (i.e., number of factors) minimization and the sec- 
ondary objective is gate size (i.e., number of literals in each factor) reduction. To share 
factors among multiple outputs, candidate factors are selected among common sub- 
expressions in PPRM expansions. However, there is no guarantee that the resulting 
PPRM expressions contain fewer terms [Saeedi et al. 2007a]. To relax optimization 
criteria, instead of evaluating previously substituted factors before new substitutions, 
Saeedi et al. [2007a] considered all new factors first and proposed a hybrid method 
that applies the second approach before the first. Donald and Jha [2008] improved 
the method of [Gupta et al. 2006] to handle gates in the NCTSFP library in a simi- 
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Fig. 11. A general outline of cycle-based methods (a), an example of the T|C|T|N method (b). 

lar search-based framework. These algorithms can handle various gate types. Their 
performance is affected by the number of input bits and the size of resulting circuits. 

Cycle-based methods. Instead of working with an entire permutation, one can 
factor it into a set of cycles and synthesize the resulting cycles separately as illustrated 
in Fig. 11a. This divide-and-conquer approach is particularly successful with reversible 
transformations that leave many inputs unchanged (sparse transformations). 

Shende et al. [2003] proposed an NCT-based synthesis method which applies NOT 
(N), ToffoH (T), CNOT (C), and Toffoh (T) gates in order (i.e., the T|C|T|N method) to 
synthesize a permutation. As illustrated in Fig. lib, in the first C|T|N part, the terms 
and 2* of a given function are positioned at their right locations. The last Toffoli circuit 
fixes the other truth table terms by decomposing the resulting permutation into a set 
of transpositions. Subsequently, each pair of disjoint transpositions is implemented 
by a synthesis algorithm separately, and the final circuit is constructed by cascading 
individual circuits. A similar method was introduced in [Yang et al. 2006] except for 
working with neighboring 3-cycles, i.e., cycles whose elements differ only in two bits. 
This technique often produces an unnecessarily large number of cycles. An extension 
of method from [Shende et al. 2003] described in [Prasad et al. 2006] reduces synthesis 
cost by applying NOT and CNOT instead of Toffoli in many situations. 

Saeedi et al. [2010a] developed fc-cycle synthesis, leading to significant reductions 
in the quantum cost for large cycles, based on seven building blocks — a pair of 2- 
cycles, a single 3-cycle, a pair of 3-cycles, a single 5-cycle, a pair of 5-cycles, a single 
2-cycle (4-cycle) followed by a single 4-cycle (2-cycle), and a pair of 4-cycles — and 
a set of algorithms to synthesize a given cycle of length less than six [Saeedi et al. 
2010a]. Larger cycles are factorized into proposed building blocks. A hybrid synthesis 
framework was suggested which uses the cycle-based approach for irregular functions 
in conjunction with the method of [Maslov et al. 2007] for regular functions. The pro- 
posed cycle-based method leads to best-known circuits with respect to quantum cost 
for permutations which have no regular pattern. In addition, the maximum number 
of elementary gates for any permutation function in [Saeedi et al. 2010b] is less than 
8.5n2" + o(2"), which is the sharpest upper bound for reversible functions so far (the 
lower bound is rt2"/log7i [Shende et al. 2003]). A more efficient decomposition algo- 
rithm was proposed in [Saeedi et al. 2010b] which produces all minimal and inequiv- 
alent factorizations each of which contains the maximum number of disjoint cycles. 
These decompositions are used in a cycle-assignment algorithm based on the graph 
matching problem to select the best possible cycle pairs during synthesis. 

BDD-based methods. Kerntopf [2004] introduced a synthesis algorithm that uses 
binary decision diagrams (BDDs), and seeks to minimize the number of non-terminal 
DD nodes. At each step, all possible gates are examined, and the corresponding deci- 
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Fig. 12. A BDD-based circuit synthesis technique by example [Wille and Drechsler 2009] (a), templates of 
sizes 2 and 4 [Maslov et al. 2007] (b). 



sion diagrams are constructed. The gates that minimize the complexity metric are se- 
lected and further analyzed by repeating the same process. Wille and Drechsler [2009] 
introduced a different algorithm that starts by constructing a BDD. Each BDD node 
is substituted by a cascade of reversible gates as shown in Fig. 12a. Node sharing due 
to reduction rules in ROBDDs can cause gate fanout which is prohibited in reversible 
logic. To overcome this obstacle, the algorithm adds constant bits to emulate fanout — 
in the worst case, for each BED node, a new constant line may be added. While this 
algorithm leads to a good reduction in both quantum cost and runtime, many constant 
and garbage bits are added which makes the results impractical for quantum comput- 
ers with a limited number of qubits. Wille et al. [2010b] reduced the number of lines 
by a post-processing technique where garbage lines are merged with appropriate con- 
stant lines. Although BDD-based techniques for reversible synthesis scale better than 
most other approaches, the large number of ancillae they generate makes the results 
difficult to use in practice, and the effort to consolidate ancillae can be substantial. 

5. POST-SYNTHESIS OPTIMIZATION 

To improve the results of synthesis algorithms, several optimization methods consider 
connected subsets of gates (sub-circuits) in a given circuit. Such sub-circuits are ana- 
lyzed one by one and replaced by equivalent (smaller) sub-circuits to improve cost. This 
sub-circuit replacement approach can leverage earlier-discussed techniques to improve 
large circuits using peephole optimization with linear runtime [Prasad et al. 2006]. 

5.1. Quantum Cost Improvement 

Equivalent sub-circuits can be found using either windowing or sub-circuit optimiza- 
tion and replacement [Prasad et al. 2006; Maslov et al. 2007; Maslov and Saeedi 2011]. 

Library-based optimization. Prasad et al. [2006] proposed an algorithm that uses 
a large database of optimal circuits and seeks sub-circuits that can be replaced by 
smaller equivalent sub-circuits. In practice, the stored sub-circuits are likely to be very 
limited in size. Prasad et al. [2006] introduced a compact data structure that can store 
all 3-bit reversible circuits and many 4-bit circuits with less than six gates. A window- 
ing strategy proposed in [Prasad et al. 2006] to identify contiguous sub-circuits can 
reorder some gates (without changing the overall functionality) to assemble larger 4- 
bit sub-circuits. The functionality of the sub-circuit found is computed, and a database 
look-up is performed to find an optimal circuit that implements the same function- 
ality The sub-circuit is replaced if this improves cost. Originally, this algorithm was 
applied to optimize reversible circuits composed of NOT, CNOT and Toffoli gates, but 
it can work with other gates as well. Such optimizations rely heavily on a database of 
optimal implementations and an efficient windowing strategy. 

Each circuit stored in a library can be viewed as a rule that simplifies any other cir- 
cuit that computes the same function. For example, pairs of inverters, pairs of CNOTs 
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and pairs of Toffoli gates cancel out because the same function can be computed by an 
empty circuit. To reduce the size of a hbrary, such rules can be generalized by local 
circuit transformations, leading to more compact rule sets. 

Transformation rules and template-based optimization. The work performed 
in [Iwama et al. 2002] introduced the idea of local transformation of reversible circuits. 
While the main purpose of this work was not post-synthesis optimization, its results 
were extended by other researchers to improve circuit cost. The authors defined a 
canonical form for circuits in the NCT library, and introduced a complete set of rules 
to transform any NCT-constructible circuit into its canonical form, which may or may 
not be compact. 

The concept of applying a rule set was extended in [Miller et al. 2003] where the 
authors introduced several transformation rules based on a set of predefined patterns 
called templates. A template T is a reversible circuit that implements the identity 
function, which contains m gates gi, .92, ■ ■ • , Qm- For a circuit with multiple-control 
Toffoli and Fredkin gates, consider the first k (k > m/2) gates of T (i.e., gi, 92, ■ ■ • gk)- 
Suppose that these gates are found in a reversible circuit in sequence. It can be verified 
that the set of m — fc gates .g,„, • • • , gk+2, gu+i can be applied instead of the initial gi, 
92, ■ ■ ■ gk gates to reduce the gate count from fc to m — fc.^^ The authors showed that 
applying the template matching method (called template application algorithm) with 
two- and three-input templates only can improve the circuits. 

In [Maslov et al. 2005b], template matching with up to six gates was used in post- 
synthesis optimization. The authors showed that there are 0, 1, 0, 1, 1, and 4 templates 
for gates with 1, 2, 3, 4, 5, and 6 gates, respectively. Their analysis shows that these 
seven templates comprise a complete set of templates of size < 6, for < 4 inputs. Sim- 
ilarly, the Toffoli-Fredkin templates were explored in [Maslov et al. 2005a] where the 
authors showed that there are 0, 1, 0, 3, 1, and 1 Toffoli-Fredkin templates for gates 
with 1, 2, 3, 4, 5, and 6 gates. Toffoli templates were extended in [Maslov et al. 2007] 
by the addition of all templates of size 7 (five templates) and a set of templates of size 
9 (four templates). Fig. 12b shows templates of sizes 2 and 4. In addition, the template 
application algorithm was enhanced leading to two templates of size 4 (vs. 1) and three 
templates of size 6 (vs. 4). Saeedi et al. [2011b] extended the templates to work with 
up to three SWAP gates. Template-based optimizations can be time-consuming, but 
scale to large circuits due to their local nature [Maslov et al. 2007]. One can restrict 
template application to small subsets of gates and lines to improve runtime. Such 
post-processing can be used in peephole optimization with guaranteed linear runtime 
[Prasad et al. 2006]. 

Arabzadeh et al. [2010] proposed a set of simplification rules in terms of positively 
and negatively controlled Toffoli gates. To optimize a sub-circuit which has gates with 
identical target as illustrated in Fig. 13a, a C"^^NOT gate is represented by a Boolean 
expression with n — 1 inputs and one output where gate controls act as the inputs and 
the target behaves as the output. Next, this gate fills one cell of a Karnaugh map (K- 
map) of size n (i.e., n — 1 inputs, one output). To extract a simplified circuit, one can 
use a K-map cell clustering similar to the one used in irreversible logic. The authors 
showed that each cell with the value 1 can be used in an odd number of groups and 
each cell with the value can be used in an even number of groups. Some templates in 
[Maslov et al. 2005b], e.g., the ones in Fig. 12, can be regenerated by applying a set of 
simplification rules from [Arabzadeh et al. 2010]. This simplification approach is suit- 
able for methods that generate subsequent gates on the same target line [Mishchenko 
and Perkowski 2002; Saeedi et al. 2007b]. An optimization in [Soeken et al. 2010b] 



If 91 ■ ■ ■ 9fe9fe+i ■■■9m = I, then g^--g^= g^^ ■ ■ ■ g^^^. For self-inverse gates (Toffoli, Fredkin), g ^ = g. 
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Fig. 13. K-map-based optimization [Arabzadeh et al. 2010] (a), ancilla insertion [Miller et al. 2010] (b). 
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Fig. 14. Circuit equivalence used in [Maslov and Saeedi 2011]. 



uses a window to select potential sub-circuits first. Then, re-synthesis, exact synthesis 
and template matching methods are applied to improve the selected sub-circuits. 

Qubit insertion. Circuits can be simplified by adding ancillae. A well-known exam- 
ple is the implementation of the n-bit multiple-control Toffoli gate discussed in Section 
3. Another example, the generic algorithm in [Miller et al. 2010] searches sub-circuits 
with a set of shared controls C. Their gates are simplified by removing controls in C. 
Two identical multiple-control Toffoli gates are inserted before and after the simpli- 
fied sub-circuit as illustrated in Fig. 13b where their controls are the qubits in C and 
targets are on the zero-initialized line. This modification produces an equivalent but 
smaller circuit if the cost of added gates is smaller than that of removed controls in 
the multiple-control gates. This idea was further extended in [Miller et al. 2010] to add 
multiple ancillae. 

To compute a Boolean function by a quantum circuit, it is common to use only re- 
versible (non-quantum) gates. However, the use of quantum gates offers more freedom 
and may facilitate smaller circuits in some cases. Maslov and Saeedi [2011] proposed 
a circuit optimization that uses quantum Hadamard gates and therefore ventures be- 
yond the Boolean domain. For a reversible circuit RC and |00...0) ancillae, they con- 
sider the transformation \x) |00...0) i-> RC\x) |00...0) with at most n ancilla for n pri- 
mary inputs in the original reversible circuit. The ancillae are prepared by a layer of 
Hadamard gates, as shown in Fig. 14. Sets of adjacent gates with shared controls are 
identified. Since H^'' |GG...O) is a 1-eigenvector of any 0-1 unitary matrix RC, applying 
RC to this eigenvector does not modify the state. After that, the shared controls are 
removed from the gates involved. The values are transferred to the ancillae by apply- 
ing a set of Fredkin gates, and returned to the main qubits by reapplying the same 
set of Fredkin gates in the reverse order. This optimization is applied opportunisti- 
cally wherever it improves circuit cost. It is particularly suitable for reversible circuits 
with many complex gates which can be easily reordered, such as those produced in 
[Mishchenko and Perkowski 2002]. 
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5.2. Reducing Circuit Depth 

Parallel circuits speed up computation and can tolerate smaller coherence times. 
Maslov et al. [2008a] introduced a level compaction algorithm to reduce circuit level (or 
depth) of synthesized circuits by employing templates. To this end, a greedy algorithm 
was proposed which assigns an undefined level to all gates initially. Next, for each level 
i the leftmost gate with an undefined level is examined to verify whether this gate can 
be executed at level i or not. This process is continued until the algorithm finds no 
gate for execution at the i-th level. Next, a set of templates is applied, to change the 
control and target lines of different gates, and the level assignment process is repeated 
with the hope of improving circuit depth. Finally, i is incremented and other gates 
are examined similarly. While the proposed method is useful for level compaction, its 
efficiency can be improved by applying a more efficient gate selection method. 

5.3. Improving Locality 

Quantum-circuit technologies often require that each gate involve only geometrically 
adjacent qubits (in a particular physical layout). Given a fixed number of qubits, a 
quantum architecture can be described by a simple connected graph G — iV,E), where 
the vertices V represent qubits and edges E represent adjacent qubit pairs where gates 
can be applied [Cheung et al. 2007]. A complete graph, /\„, expresses the absence of 
constraints. The LNN (Linear Nearest Neighbor) architecture corresponds to a graph 
with n vertices wi , • • • , w„ where an edge exists between only neighboring vertices 
and Uj+i for 1 < i < n. Several systems of trapped ions [Haffner et al. 2005], liq- 
uid NMR (e.g., [Laforest et al. 2007]), and the original Kane model [Kane 1998] have 
been designed based on the interactions between linear nearest neighbor qubits. Two- 
dimensional square lattices (2DSL) corresponds to a graph on a two-dimensional Man- 
hattan grid where only four neighboring qubits can interact. The relevant proposals for 
2DSL include arrays of trapped ions [Haffner et al. 2005], Kane's architecture [Skin- 
ner et al. 2003], and Josephson junctions [Dougot et al. 2004]. The three-dimensional 
square lattices (SDSL) model is a set of stacked 2D lattices where a qubit can interact 
with six neighboring qubits. SDSL is less restrictive, but suffers from the difficulty of 
controlling SD qubits. The architecture proposed in [Perez-Delgado et al. 2006] relies 
on the SDSL model. Cheung et al. [2007] introduced other architectures including the 
Star architecture with one vertex of degree n — 1 connected to all other vertices, and the 
Cycle iC'n), which is LNN with one extra interaction between the first and last qubit. 
The fc-th power of the graph G, denoted by G'^ such as LNN'^ , is the graph over the 
same vertex set (of G) with edges representing paths of length k in G. 

SWAP insertion. A naive method to satisfy (architectural) qubit-interaction con- 
straints is to use SWAP gates in front of an improper gate g to 'move' the control 
(target) line of g towards the target (control) line as much as required. Subsequently, 
SWAP gates should be added to restore the original ordering of circuit lines. This pro- 
cess can be repeated for all gates. More efficient circuits were found in application- 
specific studies that explored the physical implementations of the quantum Fourier 
transform [Takahashi et al. 2007; Maslov 2007], Shor's factorization algorithm [Fowler 
et al. 2004a; Kutin 2007], quantum addition [Choi and Van Meter 2008], and quantum 
error correction [Fowler et al. 2004b] for the LNN architectures. Researchers consid- 
ered the impact of LNN constraints on the synthesis of general quantum/reversible cir- 
cuits in [Shende et al. 2006] where their number of gates was increased by almost an 



^''The study of parallel quantum algorithms has attracted attention in complexity theory too. NC; is the 
class of decision problems solvable by a uniform family of Boolean circuits, with polynomial size, depth 
O(log*(n)), and fan-in 2. QNC*^ is the class of constant-depth quantum circuits without fanout gates. The 
question whether P C NC^ or P C QNC" is open. 
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order of magnitude, and in [Mottonen and Vartiainen 2006] and [Saeedi at al. 2011a] 
where their numbers of CNOT gates were increased by at most a factor of < 2. Che- 
ung et al. [2007] discussed the translation overhead for converting an arbitrary circuit 
from one architecture to another one. Particularly, they showed that translating a cir- 
cuit from /v„ to Star, LNN, 2DSL, and SDSL requires 0{n), 0{7i), 0{^/n), and O(^) 
overhead, respectively Converting Star, 2DSL, and SDSL, LNN'', C„, and to LNN 
requires 0(1), 0{^/n), 0{\^), 0{k), 0(1), and 0(1) overhead, respectively. Most im- 
portantly. Star is the weakest architecture among those considered, e.g., the overhead 
of converting a circuit from LNN to Star is 0{n). 

SWAP optimization. To adapt circuits to restricted architectures, synthesis algo- 
rithms can minimize the number of elementary gates or the SWAP gates. In this con- 
text, exact and heuristic synthesis algorithms as well as post-synthesis optimization 
methods can be applied. Unlike optimal methods, heuristic post-synthesis optimiza- 
tions scale well to large functions. Template matching for SWAP reduction and re- 
ordering strategies, global and local reordering, were introduced as powerful tools for 
SWAP reduction in [Saeedi et al. 2011b]. In global reordering, lines with the highest 
interaction impact are sequentially chosen for reordering and placed at the middle 
line. This procedure is repeated until the cost cannot be reduced. In contrast, local re- 
ordering traverses a given circuit from inputs to outputs and adds SWAP gates only in 
front of each non-local gate, but not after. Instead, the resulting ordering is used for the 
rest of the circuit. This process is repeated until all gates are traversed, as illustrated 
in Fig. 15. Similar reordering scenarios were applied by hand to reduce the number of 
SWAPS in specific circuits, e.g., in [Takahashi et al. 2007] for QFT 

Ensuring the minimal possible number of SWAP gates. For a qubit set 
{xi,x2, • • • , Xn}, assume that 1st, 2nd, and ri-th qubits should be placed at C{xi), 
C{x2),-; and C{xn) positions (C is the transformation function) to make a gate local. 
Hirata et al. [2011] showed the number of SWAP gates necessary for this purpose is at 
least the number of pairs in the set S = {{x^, Xj)\C{xj) < C{xi), i < j} and a bubble sort 
generates this minimum number of SWAPs for each gate. To find the minimal num- 
ber of SWAP gates for a given circuit, all possible qubit orderings can be exhaustively 
searched. However, the efficiency of this approach is limited by the large search space. 
On the other hand, for two qubits positioned at the locations ii and 12 (12 > h), only 
those qubits that are placed between them need to be considered (i.e., {i2 - ii)! permu- 
tations instead of all n\ permutations) [Hirata et al. 2011]. To further improve runtime, 
Hirata et al. [2011] considered only {12 — ii) permutations for each gate and analyzed 
only w consecutive gates instead of considering all possible gates as performed by an 
exhaustive method. Applying the techniques of [Hirata et al. 2011] with w = 10 on the 
8-qubit AQFT5 circuit^^ improved hand-optimized results of [Takahashi et al. 2007]. 
The cost of the AQFT circuit was further optimized by templates introduced in [Saeedi 
et al. 2011b]. Minimizing AQFT circuits is an open challenge. 

Key synthesis and optimization algorithms are compared in Table II. 

6. BENCHMARKS AND SOFTWARE TOOLS 

To analyze the effectiveness of reversible-logic synthesis algorithms, a variety of bench- 
mark functions are available. Maslov [2011] developed and has been maintaining the 

^^The Quantum Fourier Transform plays a key role in many quantum algorithms. As the number of input 
qubits grows, QFT needs exponentially smaller phase shifts, which complicates its physical implementation. 
Therefore, the Approximate Quantum Fourier Transform (AQFTm) was defined by circuits created from 
QFT except that all phase shift gates Rp with phase 2-7t/2p are ignored for m > p {m is the approximation 
parameter). While a QFT of size n requires O(n^) gates to implement, Cheung [2004] showed that AQFTm, 
m = logj n, with 0(n logj n) gates achieves almost the same accuracy level. 
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Fig. 1 5. Global and local reordering [Saeedi et al. 2011b]. 



Table II. Synthesis and optimization algorithms for reversible circuits. 



SYNTHESIS METHOD 


l^'EATURES 


LIMITATIONS 


LIBRARY 


METRIC 


[Shende et al. 2003] 


Heuristic synthesis 
Fast 

No garbage 


Circuit dependency 


NCT 


QC 


[Prasad et al. 2006] 


Optimization 
Fast 

No garbage 


Library dependency 
Local optimum only 
Circuit dependency 
Windowing strategy 


NCT 


QC 


[Gupta et al. 2006] 


Heuristic synthesis 
No garbage 


Limited scalability 


NCT 


QC 


[Hung et al. 2006] 


Optimal synthesis 


Limited scalability 


NCV 


QC 


[Maslov et al. 2007] 


Heuristic synthesis 
Optimization 


Large runtime 
Function dependency 
Local optimum only 
Windowing strategy 


NCT 


QC 


[Maslov et al. 2008a] 


Optimization 
Fast 


Windowing strategy 
Local optimum only 


NCT 


Depth 


[Donald and Jha 2008] 


Heuristic synthesis 
No garbage 


Limited scalability 


NCTSFP 


QC 


[Grosse et al. 2009] 


Optimal synthesis 


Limited scalability 


NCT 


QC 


[Wille and Drechsler 2009] 


Heuristic synthesis 
Fast, scalable 
Compact circuits 


Numerous ancillae 


NCT 


QC 


[Wille et al. 2010b] 


Optimization 
Fast 


Local optimum only 
Circuit dependency 
Windowing strategy 


NCT 


Ancilla 


[Arabzadeh et al. 2010] 


Optimization 
Fast 


Local optimum only 
Circuit dependency 
Windowing strategy 


NCT 


QC 


[Saeedi et al. 2010a] 


Heuristic synthesis 
Fast 

No garbage 


Function dependency 


NCT 


QC 


[Saeedi et al. 2011b] 


Optimization 
Fast 


Local optimum only 
Circuit dependency 


Any 


Locality 


[Hirata et al. 2011] 


Optimization 


Local optimum only 
Circuit dependency 


Any 


Locality 


[Maslov and Saeedi 2011] 


Optimization 
Fast 


Local optimum only 
Circuit dependency 


Any 


QC 



Reversible Logic Synthesis Benchmarks Page which offers the widely-used benchmark 
functions for reversible logic and their best-known circuits (as communicated to the 
maintainor). RevLib introduced in [Wille et al. 2008a] is not limited to best-known 
circuits and, in addition to some results from [Maslov 2011], includes a variety of sub- 
optimal circuits. The open-source toolkit RevKit [Soeken et al. 2010a] includes several 
utilities and implements algorithms for reversible circuit synthesis. In [Wille et al. 
2010a], a programming language was proposed to specify a reversible transformation 
from which a compiler can generate reversible circuits. A circuit browser, RCViewer, 
was developed by Scott and Maslov in 2003, and later described in [Maslov 2011]. An 
improved version, RCViewer+, was reported in [Arabzadeh and Saeedi 2011]. RevKit 
and RCViewer+ support a number of features — circuit visualization and cost anal- 



A:26 



M. Saeedi and I. L. Markov 



ysis, equivalence checking, and circuit diagram plotting using EH^jX q-circuit format 
(http://www.cquic.org/Qcircuit/). 
The most common reversible benchmark families are as follows. 



• Reversible functions with known optimal circuits include all 3-input [Shende 
et al. 2003] and 4-input reversible functions [Golubitsky et al. 2010]. 

o 4-bit functions with maximal gate count: This set introduced by Golubitsky and 
Maslov [Maslov 2011] contains all 4-bit functions whose optimal implementations 
use 15 gates (largest number possible). 

o Gray code transforms: The TV-bit transform GraycodeN converts binary-coded in- 
tegers to Gray-coded integers. As this function is GF(2)-linear, an optimal circuit 
requires only CNOT gates: CN0T(6,a) CNOT(c,fe) .... CNOT(z,zj) for qubits a, b, 
z. Several heuristics [Gupta et al. 2006; Maslov et al. 2007; Saeedi et al. 2010a] 
produce optimal circuits for this family of functions. 

• Arithmetic functions have applications in quantum algorithms [Childs and van 
Dam 2010]. In conventional circuit design, 32-bit and 64-bit arithmetic circuits are 
of significant interest because they are used in word-level CPUs. Sophisticated op- 
timizations have been developed for such special cases [Dimitrov et al. 2011]. How- 
ever, no such standard sizes have been established for reversible circuits and appli- 
cations in quantum computing suggest that such standardization is highly unlikely. 
Therefore, the design of arithmetic circuits focuses on scalable benchmarks and syn- 
thesis algorithms rather than a handful of super-optimized circuits. Another distinc- 
tion from conventional logic circuits is that (as of 2011) we are unable to motivate 
studies in reversible implementations of floating-point arithmetic. 

o Adders: The function nbitadder introduced in [Feynman 1986] has two n-bit in- 
puts A and B and one {n + l)-bit output A + B. Quantum circuits for elementary 
arithmetic operations are important for the implementation of Shor's factoriza- 
tion algorithm. With one ancilla, a quantum circuit with depth 2n + 4 and size 
9n — 8 was proposed in [Cuccaro et al. 2005]. A variant method yields a circuit 
with size 6?! + 1 and depth 6?! + 1. Takahashi et al. [2010] proposed a quantum cir- 
cuit with depth 5n — 3 and size 7?! — 6 for nbitadder with no ancilla and an 0(d(n))- 
depth 0(n)-size quantum circuit with O(^j^) ancillae where d{n) = fl(logn). 

o Modulo adders: The function modN adder has 2[log2 A''] inputs/outputs where for 
each codeword, the input is a pair of modulo-A^ numbers A and B, while the 
output is the pair of modulo- numbers (A, A + B mod N). As of 2011, the best 
results for modNadder functions were obtained by [Maslov et al. 2007] 

o Galois field multipliers: The Galois field multiplication function gf pAmmult [Che- 
ung et al. 2009] has 2m\log2p\ inputs and m[log2p] outputs. It computes the 
field product of two GF(p™) elements, a and b. GFip'") is used in a quantum 
polynomial-time algorithm that computes discrete logarithm over an elliptic 
curve group, and it has applications in quantum cryptography. An 0(m)-depth 
multiplication circuit for GF(2'") targeted for an LNN architecture was proposed 
in [Cheung et al. 2009]. 

o Divisibility checkers: The function NmodK has N inputs and a single output. Its 
output is 1 for those codewords that are divisible by the integer K. As of 2011, 
the best results for NmodK functions were obtained by [Maslov et al. 2005b]. 
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• Hard benchmarks are mainly proposed to stress-test the existing synthesis al- 
gorithms. These functions may be produced from hard benchmarks developed for 
conventional logic synthesis. 

o Hidden weighted-bit function: The function hwbN has inputs/outputs where the 
input codeword is cyclically shifted by the number of ones it has. The conventional 
HWB function returns the value of the input bit, indexed by the number of ones 
(mod n), and all of its ROBDDs have exponential size [BoUig et al. 1999]. Markov 
and Maslov showed that hwbN functions can be implemented with a polynomial 
cost 0(nlog^?T,) if a logarithmic number of garbage bits [logn] + 1 is available 
[Maslov 2011]. Efficient synthesis with no garbage bits remains open. Known 
circuits for hwb functions with no ancilla exhibit exponential number of gates. 
As of 2011, the best results for medium-size hwbN functions with no ancilla were 
obtained by applying the method of [Saeedi et al. 2010a]. 

o Reversible variants of high- complexity functions. 

o Computation of N-th prime: The function nth_primeK_inc introduced as a re- 
versible benchmark in [Maslov 2011] has K inputs/outputs. For an input value 
n, this function returns the n-th prime, as long as this prime may be written 
using at most K bits. The algorithm in [Lagarias and Odlyzko 1987] runs in 
exponential time, and no poly-time circuits or algorithms are known as of 2011. 
The smallest circuits to date are shown in [Saeedi et al. 2010a]. For a simpler 
problem — primality testing — polynomial circuits are proven to exist, but no 
practical constructions are known as of 2011 [Markov and Roy 2003]. 

o Computation of the matrix permanent: The function permanent NxN intro- 
duced by Maslov as a reversible benchmark has N'^ inputs and [log(A^!)] out- 
puts. It computes the permanent of a 0-1 matrix. There is strong evidence 
that no polynomial-time non-quantum algorithm exists for this computation 
[Jerrum et al. 2004]. 

Other reversible functions considered as benchmarks [Maslov 2011; Wille et al. 
2008a; Gupta et al. 2006; Grassl 2003] include Hamming coding functions and quan- 
tum error-correcting codes. 

7. CONCLUSION AND FUTURE DIRECTIONS 

Reversible logic circuits have been studied for at least 30 years [Toffoli 1980], with 
several different motivations in mind — from low-power computing and bit-twiddles in 
computer graphics algorithms, to photonic circuits and quantum information process- 
ing. Synthesis of reversible logic circuits is typically partitioned into (i) a pre-synthesis 
optimization that revises the specification, {ii) synthesis per se, {Hi) post-synthesis lo- 
cal optimization, and {iv) technology mapping that reflects specific limitations of a 
given implementation technology. 

Despite significant progress in reversible logic synthesis, a number of open chal- 
lenges remain — some are in the domain of reversible circuits and others in the 
broader domain of quantum information processing. In particular, existing reversible 
synthesis techniques do not perform well on important benchmarks such as arithmetic 
functions — they produce circuits that are much larger compared to known solutions. 

• Traditional reversible logic synthesis. 



^**The permanent of an n x n matrix A = (aij ) is defined as perm(A) = J2a-eS„ 0"=! '^i,a(i) where Sn is the 
symmetric group and cr is a permutation in S„. For example, for N = 2 the permanent is 01,102,2 +01,202,1- 
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o Scalable synthesis of general reversible functions targeting different cost models 

and gate libraries without significant overhead, 
o Technology mapping (see Section 3) for specific applications, gate libraries, and 

cost models. 

o Optimal (or efficient) synthesis of reversible functions useful in specific applica- 
tions, e.g., quantum algorithms such as Shor's number-factoring, 
o Efficient synthesis algorithms for T-constructible permutations. 

• Lower and upper bounds, and worst-case reversible functions. 

o Sharper lower and upper bounds on the number of elementary gates for re- 
versible functions. Current lower bound is n2"/logn [Shende et al. 2003] and 
upper bound is 8.5n2" + o(2") [Saeedi et al. 2010a]. 

o Lower and upper bounds on the number of elementary gates for T-constructible 
reversible functions. 

o Super-linear lower bounds on the size and depth of NC-circuits (and stabilizer 
circuits) for specific functions [Aaronson and Gottesman 2004]. 

o The minimal number of CNOT gates required for the implementation of an n- 
qubit Toffoli (or other useful gates such as Fredkin) gate with and without an- 
cillae. Without ancilla, the number of CNOT gates is 6(n^), and 9(n) gates are 
sufficient when at least one ancilla is available [Barenco et al. 1995]. 

• Optimization of quantum circuits. 

o Synthesis of circuits with provably minimum size (and depth) for stabilizer (or 

GF(2)-linear) operators [Aaronson and Gottesman 2004]. 
o Small quantum circuits for permutation functions [Maslov and Saeedi 2011]. 

In addition to these rather specific challenges, entirely new concepts and techniques 
may be discovered for representing reversible functions and synthesizing reversible 
circuits. Further considerations for future research are summarized below. 

Keeping applications in mind. Given that reversible logic circuits today are 
largely motivated by quantum, nano and photonic computing, we note that these novel 
computing paradigms promise improvements in rather narrow circumstances, while 
suffering from serious general drawbacks. For example, quantum algorithms are likely 
to be handicapped by size limitations, as well as quantum noise and decoherence, but 
offer polynomial-time algorithms for certain problems where conventional computers 
currently spend exponential time [Nielsen and Chuang 2000]. Therefore, it does not 
necessarily make sense to study reversible versions of every conventional circuit. Aside 
from trying to stress-test logic synthesis tools, specific reversible circuits must be mo- 
tivated by applications. For example, reversible adders and modular multipliers have 
been motivated by Shor's quantum number-factoring algorithm [Beckman et al. 1996; 
Van Meter and Itoh 2005] which leverages unique properties of quantum circuits. 

Sequential reversible computation has been studied as early as in the 1980s in 
[Toffoli 1980], but research on this topic is still lacking sufficient motivation. In conven- 
tional circuits, sequential elements are clocked, but reversible clocking has not been 
considered (and may not make sense, since this is not a logic operation). This under- 
mines considerations of low power for sequential reversible computation, as clocking 
and clocked elements consume a large fraction of energy used by CMOS circuits. Most 
uses of reversible transformations in cryptography, DSP and computer graphics are 
combinational in nature. Most quantum computers use stationary qubits and apply 
gates to these qubits, unlike CMOS circuits where gates are stationary and signals 
traverse the circuits. In the context of stationary qubits, quantum (and reversible) cir- 
cuits already have significant sequential semantics [Morita 2008], and there is no need 
for dedicated sequential elements. Algorithms for the design, analysis and verification 
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of Quantum Finite Automata [Moore and Crutchfield 2000] may be of some interest, if 
sufficiently motivated by applications. 

Verification of reversible circuits is important because circuit optimization al- 
gorithms and software tools have become so complex that subtle bugs are very likely. 
Fortunately, equivalence-checking techniques for conventional combinational circuits 
can be applied by converting CNOT gates to XORs, Toffoli gates into ANDs and XORs, 
and so on (Fig. la). Additional efficiency improvements can be obtained by exploiting 
reversibility as illustrated in [Wille et al. 2009; Yamashita and Markov 2010]. Verifi- 
cation of non-Boolean quantum circuits is more challenging and, in general, appears 
as hard as quantum simulation [Viamontes et al. 2007; Yamashita and Markov 2010]. 

Circuit test is vital to check if a circuit works as expected. Efficient test is critical for 
mass-production facilities, whereas laboratory experiments emphasize precision. Test 
techniques are sensitive to dominant fault types, but fortunately CMOS circuits can 
be tested reasonably well assuming only stuck-at fault models [Bushnell and Agrawal 
2000]. Given that CMOS is not the dominant technology for reversible circuits, the 
use of stuck-at fault models in this context may be unjustified, or at least requires 
explicit justification. An example reversible fault model is given in [Polian et al. 2005], 
where circuit test is performed for missing gates. In the case of quantum circuits, test 
is particularly complicated because measurements produce nondeterministic results. 
Therefore, quantum-computing experiments are typically verified using tomography 
[Altepeter et al. 2005], i.e., plotting an entire distribution of possible outcomes. 

Error-detection and fault-tolerance techniques are motivated for circuit tech- 
nologies that are likely to experience transient faults, which is the case with quantum 
circuits. Like circuit test, these techniques are heavily dependent on fault models and 
measurement, and naive attempts to model quantum faults and error-detection by 
Boolean techniques lead to nonsensical results. Fault-tolerant quantum computing is 
an extensively developed branch of quantum information processing, and its basics are 
introduced in standard textbooks [Nielsen and Chuang 2000]. 

Quantum-logic synthesis deals with general unitary matrices and is more chal- 
lenging than reversible-logic synthesis. As of 2011, the most compact circuit construc- 
tions use ||4" - |2" + I CNOTs [Shende et al. 2006; Mottonen and Vartiainen 2006] 
and ^4" + ^2" - n - 1 one-qubit gates [Bergholm et al. 2005]. The sharpest lower 
bound on the number of CNOT gates is [i(4" - 3n - 1)] [Shende et al. 2004]. Different 
trade-offs between the number of one-qubit gates and CNOTs are explored in [Saeedi 
et al. 2011a]. Future research directions include simultaneous reduction of CNOT and 
one-qubit gates [Shende and Markov 2009], sharper lower bounds on the number of 
one-qubit and CNOT gates, and consideration of circuit depth, perhaps with ancillae. 

Physical layout and optimization of quantum circuits are crucial to map cir- 
cuit qubits into physical qubits. Currently, the layout of quantum circuits is hand- 
optimized when preparing laboratory experiments, but automated techniques are re- 
quired to systematically accomplish this task as the capacity of quantum computers 
increases. In [Maslov et al. 2008b], the authors proposed a heuristic for the placement 
problem by optimizing qubit-to-qubit interaction and showed that the problem of map- 
ping circuit qubits to physical qubits is NP-complete. 

Physical implementation of reversible circuits using switching devices with 
limited or no gain may generate new applications. Aside from quantum circuits, inter- 



^'^In ion-trap and NMR quantum computers, gates are effected by RF pulses. If the wavelength drifts too far 
from the desired value, the gate will not be applied. This situation illustrates stuck-at-0 faults on controls of 
CNOT and Toffoli gates, but stuck-at faults on bit lines would imply the loss of reversibility. 
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esting examples include implementations in CMOS powered by circuit inputs [Desoete 
and De Vos 2002; De Vos 2010a; Skoneczny et al. 2008] as well as photonic circuits 
[Politi et al. 2009; Gao et al. 2010]. 

Design and verification tools for reversible and quantum circuits have been de- 
veloped and reported by a number of groups, but in most cases they are point tools 
built to demonstrate specific algorithms. In contrast, conventional circuit-design envi- 
ronments employ long chains of inter-operating software tools. Such powerful software 
may be necessary to scale reversible and quantum circuit design beyond its current 
limitations [Svore et al. 2006; Wille and Drechsler 2010]. On the other hand, there is 
danger of developing CAD tools that are not fully motivated by applications. 
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